To construct out plots in R, we will be using the ggplot2 package. To perform data manipulation, we will be using the dplyr package.
library(dplyr)
library(ggplot2)
Rainplots are useful for summarizing the results of multiple models at the same time. In order to plot those results in ggplot2, those results must formatted properly. Specifically, the data must be organized in a data.frame with columns indicating:
An example of data organized in this way can be seen below. Other columns can be included in the data, but for the purpose of this tutorial we are focusing on the four necessary columns.
plot_data
## # A tibble: 168 x 4
## response term estimate p.value
## <chr> <chr> <dbl> <dbl>
## 1 Body Mass Index mzid_443.210809_1.7953 0.4 5.74e-52
## 2 Framingham Risk Score mzid_443.210809_1.7953 0.4 4.80e-46
## 3 Age mzid_349.202006_4.4691 0.3 1.87e-30
## 4 Female Sex mzid_373.205458_6.3700 0.6 2.47e-25
## 5 Metabolic Syndrome mzid_443.210809_1.7953 0.7 1.78e-23
## 6 Female Sex mzid_443.210809_1.7953 -0.5 3.92e-18
## 7 Age mzid_373.205458_6.3700 0.2 2.96e-17
## 8 Framingham Risk Score mzid_349.202006_4.4691 0.2 8.91e-17
## 9 Age mzid_313.238624_5.2172 0.2 1.02e-15
## 10 Body Mass Index mzid_361.236724_3.5879 -0.2 1.47e-13
## # … with 158 more rows
In this dataset, estimate and p.value indicate the regression estimate and the p.value of that estimate. term indicates the ID of the metabolite included in the model. response indicates which model that a term corresponds to. For example if response = Body Mass Index, this indicates that the regression estimate and the p.value correspond to the model where Body Mass Index was the response variable.
Before plotting, we will first transform p.value onto the negative log-scale. This allows smaller P-values, which are often of greater interest, to have larger values than P-values that might be of lesser interest.
plot_data <-
plot_data %>%
mutate(p.value = -1 * log10(p.value))
A basic rainplot can be constructed in only two lines of ggplot2 code!
rainplot <-
plot_data %>%
ggplot(aes(x = response, y = term)) +
geom_point(aes(colour = estimate, size = p.value))
rainplot
This is a good start, but we will want to clean up the visual presentation. We can do this by creating a custom ggplot2 theme and adjusting legend titles. One thing to ensure is to represent P-values (the size of the plotted points) by area instead of radius. When comparing two points of different size, humans perceive the area of points, not their radius, when comparing them. Thus a value that is twice another should have twice as much area, not double the radius. This is ensured by using scale_size_area.
thm <-
# Good starting theme + set text size
theme_light(base_size = 18) +
theme(
# Remove axis ticks and titles
axis.title.x = element_blank(),
axis.ticks.x = element_blank(),
axis.title.y = element_blank(),
axis.ticks.y = element_blank(),
# Remove Gridlines and boxes
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
axis.line = element_blank(),
legend.key = element_blank(),
# White backgrounds
panel.background = element_rect(fill = 'white'),
plot.background = element_rect(fill = 'white'),
legend.background = element_rect(fill = 'white'),
# Angle text
axis.text.x.top = element_text(angle = 45, hjust = 0)
)
rainplot <-
rainplot +
thm +
scale_x_discrete(position = 'top') +
scale_size_area(expression(paste(-log[10]('P-value')))) +
scale_color_continuous('Regression Estimate')
rainplot
To make the presentation of the regression estimates clearer, we create a diverging color scale, and set positive and negative limits equidistant from 0.
palette <-
c("#053061",
"#313695",
"#4575b4",
"#74add1",
"#abd9e9",
"#e0f3f8",
"#fee090",
"#fdae61",
"#f46d43",
"#d73027",
"#a50026",
'#67001f')
max_abs_estimate <- max(abs(plot_data$estimate))
max_lim <- max_abs_estimate
min_lim = -1 * max_lim
rainplot <- rainplot +
scale_color_gradientn(
'Regression Estimate',
colors = palette,
limits = c(min_lim, max_lim),
breaks = c(min_lim, min_lim / 2, 0 , max_lim/2, max_lim)
)
## Scale for 'colour' is already present. Adding another scale for
## 'colour', which will replace the existing scale.
rainplot
Another step to improve presentation is to increase the maximum size of each point There will be a bit of trial and error here; if the size threshold is too large, the points will overlap.
rainplot +
scale_size_area('P-value', max_size = 12)
## Scale for 'size' is already present. Adding another scale for 'size',
## which will replace the existing scale.
The points on rainplots can be outlined in a color different from the color of the point. To get such a shape, we add the argument shape = 21 to geom_point. Note that when we do this, the color of the point changes from color to fill.
rainplot <-
plot_data %>%
ggplot(aes(x = response, y = term)) +
geom_point(aes(fill = estimate, size = p.value), shape = 21) +
scale_fill_gradientn(
'Regression Estimate',
colors = palette,
limits = c(min_lim, max_lim),
breaks = c(min_lim, min_lim / 2, 0 , max_lim / 2, max_lim)
) +
scale_size_area(expression(paste(-log[10]('P-value'))), max_size = 12) +
scale_x_discrete(position = 'top') +
thm
rainplot
When a few p.values are much smaller than the majority of the data the plot loses size resolution in the range where most of the data lies. One possible solution is to set all P-values above some ceiling, here chosen to be 15, to the value of the ceiling. The threshold can be set at a level where one considers all P-values more extreme than the threshold to be ‘of interest’.
plot_data_thresholded <-
plot_data %>%
mutate(p.value = ifelse(p.value > 15, 15, p.value))
rainplot <-
plot_data_thresholded %>%
ggplot(aes(x = response, y = term)) +
geom_point(aes(colour = estimate, size = p.value)) +
scale_color_gradientn(
'Regression Estimate',
colors = palette,
limits = c(min_lim, max_lim),
breaks = c(min_lim, min_lim / 2, 0 , max_lim / 2, max_lim)
) +
scale_size_area(
expression(paste(-log[10]('P-value'))),
max_size = 12,
breaks = c(5, 10, 15),
labels = c('5', '10', '>=15')) +
scale_x_discrete(position = 'top') +
thm
rainplot
To make it easier to identify the metabolites that had small P-values in multiple models, we will convert the term variable into a factor variable ordered by the average P-value across all models. This will put metabolites with small P-values in multiples models at the top of the plot, and metabolites with large P-values in multiple models at the bottom of the plot.
# Order metabolites by average p-value
term_order <-
plot_data %>%
group_by(term) %>%
summarise(mpv = mean(p.value)) %>%
arrange(mpv) %>%
pull(term)
# Convert term to a factor, ordered by `term_order`
plot_data <-
plot_data %>%
mutate(term = factor(term, levels = term_order))
rainplot <-
plot_data %>%
ggplot(aes(x = response, y = term)) +
geom_point(aes(colour = estimate, size = p.value)) +
scale_color_gradientn(
'Regression Estimate',
colors = palette,
limits = c(min_lim, max_lim),
breaks = c(min_lim, min_lim / 2, 0 , max_lim / 2, max_lim)
) +
scale_size_area(expression(paste(-log[10]('P-value'))), max_size = 12) +
scale_x_discrete(position = 'top') +
thm
rainplot
Rainplots can be clustered like heatmaps. We will be using the hculst function to cluster the results by regression estimate. The term variable will be converted into an ordered factor, just as was done in Ordering by P-Value. In order to cluster the data, we will need to reshape it using the spread function from the tidyr package.
library(tidyr)
# Convert to matrix for clustering. `term` on the y-axis, `response` on the x-axis
cluster_data <-
plot_data %>%
select(response, term, estimate) %>%
spread(response, estimate)
rnms <-
cluster_data$term
cluster_data <-
cluster_data %>%
select(-term) %>%
as.matrix()
rownames(cluster_data) <- rnms
cluster_data[1:5, 1:5]
## Age Body Mass Index Female Sex
## mzid_357.209381_6.0558 0.1 0.1 0.2
## mzid_327.219668_2.9775 0.0 0.1 -0.1
## mzid_368.224901_6.5171 0.1 0.0 0.1
## mzid_347.224403_4.3446 0.1 0.1 -0.1
## mzid_299.259209_5.5568 0.1 -0.1 0.2
## Framingham Risk Score Incident CVD
## mzid_357.209381_6.0558 0.1 0.0
## mzid_327.219668_2.9775 0.1 0.0
## mzid_368.224901_6.5171 0.1 0.0
## mzid_347.224403_4.3446 0.1 0.0
## mzid_299.259209_5.5568 -0.1 0.1
clust <- hclust(dist(cluster_data), method = 'ward.D2')
# `clust$order` orders `term` into clusters
term_order <-
clust$labels[clust$order]
# Convert term to a factor, ordered by `term_order`
plot_data <-
plot_data %>%
mutate(term = factor(term, levels = term_order))
rainplot <-
plot_data %>%
ggplot(aes(x = response, y = term)) +
geom_point(aes(colour = estimate, size = p.value)) +
scale_color_gradientn(
'Regression Estimate',
colors = palette,
limits = c(min_lim, max_lim),
breaks = c(min_lim, min_lim / 2, 0 , max_lim / 2, max_lim)
) +
scale_size_area(expression(paste(-log[10]('P-value'))), max_size = 12) +
scale_x_discrete(position = 'top') +
thm
rainplot
Dendrograms can be added to ggplot2 plots, but it can be quite complicated. If one wants to add dendrograms to a clustered rainplot, there will be some trial and error to get everything properly aligned. Dendrograms will be created using the ggdendro package.
library(ggdendro)
# Extract dendrogram data from previous cluster results
dendro_dat <- segment(dendro_data(clust))
# basic dendrogram
dendro <-
ggplot(dendro_dat) +
geom_segment(aes(x = x, y = y, xend=xend, yend=yend), colour = 'black') +
theme_dendro()
dendro
The default dendrogram points down. To put the dendrogram on the left of our rain plot, we want it to point to the right. We can do this by switching the x and y coordinates.
dendro <-
ggplot(dendro_dat) +
geom_segment(aes(x = -y, y = x, xend=-yend, yend=xend), colour = 'black') +
theme_dendro()
dendro
We will use the gridExtra package to display our plots together. Though our dendrogram looks okay, there will be alignment issues when we first plot the dendrogram and rainplot side by side.
library(gridExtra)
grid.arrange(dendro, rainplot, ncol = 2, widths = c(3, 15))
As we can see, the dendrogram is not aligned with the rainplot labels. To get around this, we will use a hack. First, we will plot a modified version of the rainplot, selecting only the column with the longest label, under the dendrogram. This will align the plot area between the dendrogram and the rainplot. Second, we will make the modified rainplot invisible, so that the dendrogram appears on its own.
x_labels <-
plot_data$response %>%
unique()
longest_x_label <-
x_labels[[which.max(nchar(x_labels))]]
modified_rainplot_data <-
plot_data %>%
filter(response == longest_x_label)
dendro <-
ggplot() +
# One-column rainplot. White Points to be invisible.
geom_point(aes(x = response, y = term, size = p.value),
colour = 'white',
data = modified_rainplot_data) +
scale_x_discrete(position = 'top', expand = c(0, 3, 0, 0.1)) +
thm +
# Make rainplot invisible. Remove elements that don't affect alignemnt Make
# invisible (match background) elemnts that do affect alignment.
theme(legend.position = 'none',
axis.text.y = element_blank(),
axis.text.x = element_text(colour = 'white'),
panel.border = element_rect(fill = NA, colour = 'white')
) +
# Draw dendrogram
geom_segment(aes(x = (-y + 1), y = x, xend=(-yend + 1), yend=xend),
colour = 'black',
data = dendro_dat)
grid.arrange(dendro, rainplot, ncol = 2, widths = c(3, 15))
To better understand how this works, lets walk through it step by step. To make what is going on clearer, many plot elements will initially be left visible. We start by plotting the modified rainplot next to the full rainplot.
x_labels <-
plot_data$response %>%
unique()
longest_x_label <-
x_labels[[which.max(nchar(x_labels))]]
modified_rainplot_data <-
plot_data %>%
filter(response == longest_x_label)
dendro <-
ggplot() +
# One-column rainplot.
geom_point(aes(x = response, y = term, size = p.value),
colour = 'black',
data = modified_rainplot_data) +
scale_x_discrete(position = 'top') +
thm +
# Remove legend and y axis text
theme(legend.position = 'none',
axis.text.y = element_blank()
)
grid.arrange(dendro, rainplot, ncol = 2, widths = c(3, 15))
Looking at the plots side by side, we see that the plot areas are aligned. This will help align the dendrogram to the labels.
# The `+ 1` aligns the tips of the dendrograms to the plot points. This will be
# important when eliminating white space.
dendro <-
dendro +
geom_segment(aes(x = (-y + 1), y = x, xend=(-yend + 1), yend=xend),
colour = 'black',
data = dendro_dat)
grid.arrange(dendro, rainplot, ncol = 2, widths = c(3, 15))
When we add the dendrogram, we see that it is perfectly aligned to the rainplot labels. However, some of the dendrogram is outside the plotted area to the left. Additionally, there is too much white space to the right of the plot. Both those issues can be rectified through the expand argument. Note that the first non-zero argument to expand will depend on the size of the dendrogram, and will take some trial and error to discover. The second non-zero argument determines how close the right plot edge is to the bottom of the dendrogram, and is mostly a design preference.
dendro <-
dendro +
scale_x_discrete(position = 'top', expand = c(0, 3, 0, 0.1))
## Scale for 'x' is already present. Adding another scale for 'x', which
## will replace the existing scale.
grid.arrange(dendro, rainplot, ncol = 2, widths = c(3, 15))
To finalize the plot, we make invisible the remaining visual elements not a part of the dendrogram.
dendro <-
dendro +
geom_point(aes(x = response, y = term, size = p.value),
colour = 'white',
data = modified_rainplot_data) +
theme(axis.text = element_text(color = 'white'),
panel.border = element_rect(fill = NA, color = 'white'))
grid.arrange(dendro, rainplot, ncol = 2, widths = c(3, 15))